Skip to content

[V0] Correct CUDA Graph capture for encoder-decoder models #22630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 12, 2025

Conversation

Sugar-zsg
Copy link
Contributor

@Sugar-zsg Sugar-zsg commented Aug 11, 2025

This commit addresses a bug where CUDA Graph failed to provide performance benefits for Whisper-style encoder-decoder models. 
The previous implementation of CUDA Graph capture incorrectly set the max_seq_len_to_capture based solely on the encoder's max sequence length (448). For models like Whisper, the decoder's max sequence length (1500) is significantly larger. This mismatch caused the captured graph to be too small for the decoder's operations, leading to a failure to properly leverage the feature. 
The fix updates the capture logic to correctly determine the max sequence length by considering both the encoder and decoder. By ensuring the captured graph is large enough to handle both components, we can now successfully utilize CUDA Graph for these models, resulting in improved inference performance.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

This commit addresses a bug where CUDA Graph failed to provide performance benefits for Whisper-style encoder-decoder models.

The previous implementation of CUDA Graph capture incorrectly set the `max_seq_len_to_capture` based solely on the encoder's max sequence length (448). For models like Whisper, the decoder's max sequence length (1500) is significantly larger. This mismatch caused the captured graph to be too small for the decoder's operations, leading to a failure to properly leverage the feature.

The fix updates the capture logic to correctly determine the max sequence length by considering both the encoder and decoder. By ensuring the captured graph is large enough to handle both components, we can now successfully utilize CUDA Graph for these models, resulting in improved inference performance.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug with CUDA Graph capture for encoder-decoder models. The change prevents max_seq_len_to_capture from being incorrectly capped by a smaller model length, which is correct. I've suggested a more robust implementation that explicitly considers both encoder and decoder maximum lengths, making the fix more resilient and less dependent on default values.

Comment on lines 1168 to 1170
if not self.is_encoder_decoder:
self.max_seq_len_to_capture = min(self.max_seq_len_to_capture,
self.max_model_len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change correctly addresses the issue for encoder-decoder models by not capping max_seq_len_to_capture, it relies on the default value being sufficiently large. This could be fragile if a user sets a smaller max_seq_len_to_capture or for models with very large encoder sequence lengths.

A more robust approach would be to explicitly determine the maximum length by considering both encoder and decoder sequence lengths, and then use that to cap max_seq_len_to_capture. This aligns better with the PR's goal to 'correctly determine the max sequence length'.

Consider this alternative implementation:

Suggested change
if not self.is_encoder_decoder:
self.max_seq_len_to_capture = min(self.max_seq_len_to_capture,
self.max_model_len)
max_len = self.max_model_len
if self.is_encoder_decoder:
max_len = max(
max_len, getattr(self.hf_config, "max_source_positions", 0))
self.max_seq_len_to_capture = min(self.max_seq_len_to_capture, max_len)

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered the impact of this change on mllama? It would be great if you run benchmarks on both whisper and mllama

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@Sugar-zsg
Copy link
Contributor Author

Have you considered the impact of this change on mllama? It would be great if you run benchmarks on both whisper and mllama

Thanks for the suggestion.

I have verified the impact on the Whisper-large-v3-turbo model using NVIDIA H20e GPUs. The change reduced the average inference time from approximately 320ms to 190ms for the same audio, which is a significant performance gain. This updated Whisper service has been running successfully in our production environment under high load for several weeks.

Regarding mllama, I haven't used it before, but I will start some basic performance and load tests to evaluate the impact.

@Sugar-zsg
Copy link
Contributor Author

re: @DarkLight1337

Thanks again for the suggestion and for your patience.

I have completed the benchmarks for mllama and can now share the full findings.

For the meta-llama/Llama-3.2-11B-Vision-Instruct model, tested on NVIDIA H20e GPUs, the performance remained stable:

  • Before the change:
    • Token Throughput: 102.25 tokens/s
    • Average Latency: 2.504s/sample
  • After the change:
    • Token Throughput: 102.09 tokens/s
    • Average Latency: 2.508s/sample

This stability is expected, as the mllama model did not have the same CUDA Graph configuration issue that the whisper model had. CUDA Graph was already correctly enabled for mllama, so my fix did not impact its performance.

As a reminder, this fix provided a significant performance boost for Whisper-large-v3-turbo, reducing average latency from ~320ms to 190ms. This change is stable and has been running in our production environment for weeks.

I believe this change is valuable for models like Whisper, where the encoder-decoder sequence length mismatch prevents CUDA Graph from being fully utilized.

@DarkLight1337
Copy link
Member

Ok nice, LGTM then, thanks for optimizing this!

@DarkLight1337 DarkLight1337 changed the title Correct CUDA Graph capture for encoder-decoder models (V0 engine) [V0] Correct CUDA Graph capture for encoder-decoder models Aug 11, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 11, 2025 12:49
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 11, 2025
Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing this, this is great!

I feel this feature could use a unit test to make sure the V1 port #21088 correctly works with it though.

@DarkLight1337
Copy link
Member

PTAL at the failing Worker test

auto-merge was automatically disabled August 12, 2025 03:24

Head branch was pushed to by a user without write access

@Sugar-zsg
Copy link
Contributor Author

Sugar-zsg commented Aug 12, 2025

re: @DarkLight1337
I've resolved the issue with the worker test and added some comments.

@vllm-bot vllm-bot merged commit 8d17fa6 into vllm-project:main Aug 12, 2025
34 of 39 checks passed
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants